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ABSTRACT 

A new language model for speech recognition inspired by lin- 
guistic analysis is presented. The model develops hidden hierar- 
chical structure incrementally and uses it to extract meaningful 
information from the word history — thus enabling the use of 
extended distance dependencies — in an attempt to complement 
the locality of currently used trigram models. The structured lan- 
guage model, its probabilistic parameterization and performance 
in a two-pass speech recognizer are presented. Experiments on 
the SWITCHBOARD corpus show an improvement in both per- 
plexity and word error rate over conventional trigram models. 



1. INTRODUCTION 

The main goal of the present work is to develop and evaluate 
a language model that uses syntactic structure to model long- 
distance dependencies. Tha model we present is closely related 
to the one investigated in M\, however different in a few impor- 
tant aspects: 

• our model operates in a left-to-right manner, allowing the de- 
coding of word lattices, as opposed to the one referred to pre- 
viously, where only whole sentences could be processed, thus 
reducing its applicability to N-best list re-scoring; the syntactic 
structure is developed as a model component; 

• our model is a factored version of the one in M\, thus enabling 
the calculation of the joint probability of words and parse struc- 
ture; this was not possible in the previous case due to the huge 
computational complexity of that model. 

The structured language model (SLM), its probabilistic pa- 
rameterization and performance in a two-pass speech recognizer 
— we evaluate the model in a lattice decoding framework — are 
presented. Experiments on the SWITCHBOARD corpus show 
an improvement in both perplexity (PPL) and word error rate 
(WER) over conventional trigram models. 

2. STRUCTURED LANGUAGE MODEL 

An extensive presentation of the SLM can be found in [pp. The 
model assigns a probability P(W,T) to every sentence vV and 
its every possible binary parse T. The terminals of T are the 
words of W with POStags, and the nodes of T are annotated 
with phrase headwords and non-terminal labels. Let If be a 

hj-ml = (<s>. SB) h. 




(<s>,SB) (w_p, t_p) (w_{p+l|, t_{p+l|) (w_k. t_k)w_(k+l)....</s> 



Figure 1 . A word-parse k-prefix 

sentence of length n words to which we have prepended <s> 
and appended </s> so that too =<s> and w n +i =</s>. Let 
Wk be the word k-prefix wo ■ ■ ■ Wk of the sentence and W k T k 



h'_0 = (h_|-l (.word, NTlabel) 




Figure 2. Result of adjoin-left under NTlabel 



h'_(-l}=h_(-2] 



h'_0 = (hj).word, NTlabel) 



T'_{-m+ll<-<s> 
<s> 




Figure 3. Result of adjoin-right under NTlabel 

the word-parse k-prefix. Figure |l| shows a word-parse k-prefix; 
h_0 . . h_{-m} are the exposed heads, each head being a 
pair(headword, non-terminal label), or (word, POStag) in the 
case of a root-only tree. 

2.1. Probabilistic Model 

The probability P(W, T) of a word sequence W and a complete 
parse T can be broken into: 

P(W,T) = 

Uktll Piwk/Wk-iTk-i) ■ P{tk/W k -iTk-i,w k ) ■ 

l[P(p k /W k ^T k ^,w k ,t k ,p k . ■ - P ti)] (1) 



where: 

• Wk-iTk-i is the word-parse (k — l)-prefix 

• w k is the word predicted by WORD-PREDICTOR 

• tfc is the tag assigned to w k by the TAGGER 

• N k — 1 is the number of operations the PARSER executes 
at sentence position k before passing control to the WORD- 
PREDICTOR (the iVfc-th operation at position k is the null 
transition); Nk is a function of T 

• p^ denotes the i-th PARSER operation carried out at position k 
in the word string; the operations performed by the PARSER are 
illustrated in Figures ^-|| and they ensure that all possible binary 
branching parses with all possible headword and non-terminal 
label assignments for the Wi . . . w k word sequence can be gen- 
erated. 

Our model is based on three probabilities, estimated using 
deleted interpolation parameterized as follows: 



P(w k /Wk-iTk-i) 
P(t k /wk,W k -iT k -x) 
P(p k /W k T k ) 



= P(w h /ho,h-i) (2) 
= P(tk/w k ,ho-tag,h-i.tagX3) 



P(Pi /ho,h- 



(4) 
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It is worth noting that if the binary branching structure devel- 
oped by the parser were always right-branching and we mapped 



the POStag and non-terminal label vocabularies to a single type 
then our model would be equivalent to a trigram language model. 
Since the number of parses for a given word prefix Wk grows 
exponentially with k, \{Tk}\ ~ 0(2 k ), the state space of our 
model is huge even for relatively short sentences so we had 
to use a search strategy that prunes it. Our choice was a syn- 
chronous multi-stack search algorithm which is very similar to a 
beam search. 

The probability assignment for the word at position k + 1 in 
the input sentence is made using: 

P.5LM(w k+1 /W k ) = ^ P{w k+ i/W k T k ) ■ p{W k ,T k ), 
p(W k ,T k ) = P(W k T k )/ P(W k T k ) (5) 

which ensures a proper probability over strings W* , where S k 
is the set of all parses present in our stacks at the current stage k. 
An N-best EM [Q] variant is employed to reestimate the model 
parameters such that the PPL on training data is decreased — 
the likelihood of the training data under our model is increased. 
The reduction in PPL is shown experimentally to carry over to 
the test data. 

3. A* DECODER FOR LATTICES 
3.1. A* Algorithm 

The A* algorithm ^ is a tree search strategy that could be com- 
pared to depth-first tree- traversal: pursue the most promising 
path as deeply as possible. 

To be more specific, let a set of hypotheses 
L = {h : xi, . . . ,x n }, Xi 6 W* — to be scored using the 
function / (■) — be organized as a prefix tree. We wish to obtain 
the hypothesis h* = argmaxhgz, f(h) without scoring all the 
hypotheses in L, if possible with a minimal computational effort. 

To be able to pursue the most promising path, the algorithm 
needs to evaluate the possible continuations of a given prefix 
x = Wi, . . . , w p that reach the end of the lattice. Let Cl (x) be 
the set of complete continuations of x in L — they all reach the 
end of the lattice, see Figure W. Assume we have an overestimate 
g(x.y) = f(x) + h(y\x) > f(x.y) for the score of complete hy- 
pothesis x.y — . denotes concatenation; imposing that h(y\x) = 
for empty y, we have g(x) = ,f(x), V complete x £ L. This 
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Figure 4. Prefix Tree Organization of a Set of Hypotheses 

means that the quantity defined as: 

g L (x) = max g(x.y) = f(x) + h L {x), (6) 
yec L (x) 

ht(x) = max h(y\x) (7) 

yec L (x) 

is an overestimate of the most promising complete continuation 
of x in L: gi,{x) > f(x.y),\/y € Cl(x) and that gi{x) = 
f(x),M complete x 6 L. 

The A* algorithm uses a potentially infinite stackQ in which 
prefixes x are ordered in decreasing order of at each ex- 

tension step the top-most prefix x = Wi, . . . , w p is popped form 

1 The stack need not be larger than | L \ = n 



the stack, expanded with all possible one-symbol continuations 
of x in L and then all the resulting expanded prefixes — among 
which there may be complete hypotheses as well — are inserted 
back into the stack. The stopping condition is: whenever the 
popped hypothesis is a complete one, retain that one as the over- 
all best hypothesis h* . 

3.2. A* for Lattice Decoding 

There are a couple of reasons that make A* appealing for our 
problem: 

• the algorithm operates with whole prefixes x, making it ideal 
for incorporating language models whose memory is the entire 
prefix; 

• a reasonably good overestimate h(y\x) and an efficient way to 
calculate hh{x) (see Eq^) are readily available using the n-gram 
model, as we will explain later. 

The lattices we work with retain the following information af- 
ter the first pass: 

• time-alignment of each node; 

• for each link connecting two nodes in the lattice we retain: 
word identity, acoustic model score and n-gram language model 
score. The lattice has a unique starting and ending node, respec- 
tively. 

A lattice can be conceptually organized as a prefix tree of 
paths. When rescoring the lattice using a different language 
model than the one that was used in the first pass, we seek to 
find the complete path p = Iq ■ ■ ■ l n maximizing: 

n 

f(p) = Y^° 9PaM ^ 

i=0 

+ LMweight ■ logPLM{w(h)\w(lo) ■ ■ ■ w(Zj-i)) 
- logPip ] (8) 

where: 

• logPAAi(li) is the acoustic model log-likelihood assigned to 
link U; 

• logPLM (w(li)\w(lo) . . .w(h-i)) is the language model log- 
probability assigned to link U given the previous links on the 
partial path lo . . .U; 

• LMweight > is a constant weight which multiplies the 
language model score of a link; its theoretical justification is un- 
clear but experiments show its usefulness; 

• logPip > is the "insertion penalty"; again, its theoretical 
justification is unclear but experiments show its usefulness. 

To be able to apply the A* algorithm we need to find an ap- 
propriate stack entry scoring function gL(x) where x is a par- 
tial path and L is the set of complete paths in the lattice. Go- 
ing back to the definition (^) of we need an overestimate 
g(x.y) = /(*) + h(y\x) > f(x.y) for all possible y = l k ...l„ 
complete continuations of x allowed by the lattice. We propose 
to use the heuristic: 

n 

h{v\ x ) = y^JlogPAMjli) + LMweight ■ {logP NG {h) 

i — k 

+logP C OMp) ~ log Pip] 

+LMweight ■ logPpi nal ■ S(k < n) (9) 

A simple calculation shows that if 

logPxG{U) + logPcoMP > logP LM (k),Vli 

is satisfied then gnix) = f(x) + max ye c L (x)h(y\x) is a an 
appropriate choice for the A* search. 

The justification for the logPcoMP term is that it is supposed 
to compensate for the per word difference in log-probability be- 
tween the n-gram model NG and the superior model LM with 
which we rescore the lattice — hence logPcoMP > 0. Its ex- 
pected value can be estimated from the difference in perplexity 
between the two models LM and NG. The logPpi nal > 



term is used for practical considerations as explained in the next 
section. 

The calculation of ql (x) (W) is made very efficient after re- 
alizing that one can use the dynamic programming technique 
in the Viterbi algorithm Indeed, for a given lattice L, 

the value of Iil(x) is completely determined by the identity 
of the ending node of a;; a Viterbi backward pass over the lat- 
tice can store at each node the corresponding value of }il(x) = 
h l (ending _node(x)) such that it is readily available in the A* 
search. 

3.3. Some Practical Considerations 

In practice one cannot maintain a potentially infinite stack. We 
chose to control the stack depth using two thresholds: one on 
the maximum number of entries in the stack, called stack-depth- 
threshold and another one on the maximum log-probability dif- 
ference between the top most and the bottom most hypotheses in 
the stack, called stack-logP -threshold. 

A gross overestimate used in connection with a finite stack 
may lure the search on a cluster of paths which is suboptimal 

— the desired cluster of paths may fall short of the stack if the 
overestimate happens to favor a wrong cluster. 

Also, longer partial paths — thus having shorter suffixes 

— benefit less from the per word IoqPcomp compensation 
which means that they may fall out of a stack already full 
with shorter hypotheses — which have high scores due to 
compensation. This is the justification for the IoqPfinal 
term in the compensation function h(y\x): the variance 
var[logP L M(h\lo ■ ■ - h-i) — logPNG{h)} is a finite posi- 
tive quantity so the compensation is likely to be closer to 
the expected value E[logPLM(k\lo ■ ■ ■ h-i) — logPNcik)] 
for longer y continuations than for shorter ones; introduc- 
ing a constant logPpj nal term is equivalent to an adap- 
tive logPcoMP depending on the length of the y suffix — 
smaller equivalent logPcoMP for long suffixes y for which 
E[logPiM{h\lo ■ ■ ■ h-i) — logPNG(h)] is a better estimate for 
logPcoMP than it is for shorter ones. 

Because the structured language model is computationally 
expensive, a strong limitation is being placed on the width 
of the search — controlled by the stack-depth and the 
stack-logP -threshold. For an acceptable search width 

— runtime — one seeks to tune the compensation parameters 
to maximize performance measured in terms of WER. However, 
the correlation between these parameters and the WER is not 
clear and makes search problems diagnosis extremely difficult. 
Our method for choosing the search and compensation parame- 
ters was to sample a few complete paths p\ , . . . , pjv from each 
lattice, rescore those paths according to the /(■) function M and 
then rank the h* path output by the A* search among the sam- 
pled paths. A correct A* search should result in average rank 
0. In practice this doesn't happen but one can trace the topmost 
path p* — in the offending cases p* 7^ h* and f(p") > f{h*) 

— and check whether the search failed strictly because of insuf- 
ficient compensation — a prefix of the p* hypothesis is present 
in the stack when A* returns — or because the path p* fell short 
of the stack during the search — in which case the compensation 
and the search-width interact. 

The method we chose for sampling paths from the lattice was 
an N-best search using the n-gram language model scores; this is 
appropriate for pragmatic reasons — one prefers lattice rescor- 
ing to N-best list rescoring exactly because of the possibility to 
extract a path that is not among the candidates proposed in the 
N-best list — as well as practical reasons — they are among the 
"better" paths in terms of WER. 

4. EXPERIMENTS 
4.1. Experimental Setup 

In order to train the structured language model (SLM) as de- 
scribed in 1^1 we need parse trees from which to initialize the 
parameters of the model. Fortunately a part of the Switchboard 
(SWB) n™ data has been manually parsed at UPenn ; let us re- 
fer to this corpus as the SWB-Treebank. The SWB training data 



used for speech recognition — SWB-CSR — is different from 
the SWB-Treebank in two aspects: 

• the SWB-Treebank is a subset of the SWB-CSR data; 

• the SWB-Treebank tokenization is different from that of the 
SWB-CSR corpus; among other spurious small differences, the 
most frequent ones are of the type presented in Table |l|. 



SWB-Treebank 


SWB-CSR 


do n't 


don't 


it 's 


it's 


i 'm 


i'm 


i '11 


i'll 



Table 1 . SWB-Treebank SWB-CSR tokenization mismatch 

Our goal is to train the SLM on the SWB-CSR corpus. 

4.1.1. Training Setup 

The training of the SLM model proceeded as follows: 

• train SLM on SWB-Treebank — Jjsing the SWB-Treebank 
closed vocabulary — as described in |0]; this is possible because 
for this data we have parse trees from which we can gather initial 
statistics; 

• process the SWB-CSR training data to bring it closer to the 
SWB-Treebank format. We applied the transformations sug- 
gested by Table|lJ the resulting corpus will be called SWB-CSR- 
Treebank, although at this stage we only have words and no parse 
trees for it; 

• transfer the SWB-Treebank parse trees onto the SWB-CSR- 
Treebank training corpus. To do so we parsed the SWB-CSR- 
Treebank using the SLM trained on the SWB-Treebank; the vo- 
cabulary for this step was the union between the SWB-Treebank 
and the SWB-CSR-Treebank closed vocabularies; at this stage 
SWB-CSR-Treebank is truly a "treebank"; 

• retrain the SLM on the SWB-CSR-Treebank training corpus 
using the parse trees obtained at the previous step for gathering 
initial statistics; the vocabulary used at this step was the SWB- 
CSR-Treebank closed vocabulary. 

4.1.2. Lattice Decoding Setup 

To be able to run lattice decoding experiments we need to 
bring the lattices — SWB-CSR tokenization — to the SWB- 
CSR-Treebank format. The only operation involved in this trans- 
formation is splitting certain words into two parts, as suggested 
by Table [jj Each link whose word needs to be split is cut into 
two parts and an intermediate node is inserted into the lattice as 
shown in Figure || The acoustic and language model scores of 
the initial link are copied onto the second new link. For all the 

s s 

s_time s_time i 




w_2, AMlnprob, NGlnprob 



e_time e_time 

Figure 5. Lattice Processing 

decoding experiments we have carried out, the WER is measured 
after undoing the transformations highlighted above; the refer- 
ence transcriptions for the test data were not touched and the 
N 1ST SCL1TE package was used for measuring the WER. 

4.2. Perplexity Results 

As a first step we evaluated the perplexity performance of the 
SLM relative to that of a deleted interpolation 3-gram model 
trained in the same conditions. We worked on the SWB-CSR- 
Treebank corpus. The size of the training data was 2.29 Mwds; 



the size of the test data set aside for perplexity measurements 
was 28 Kwds — WS97 DevTest [fl|. We used a closed vocab- 
ulary — test set words included in the vocabulary — of size 
22Kwds. Similar to the experiments reported in [|], we built 
a deleted interpolation 3-gram model which was used as a base- 
line; we have also linearly interpolated the SLM with the 3-gram 
baseline showing a modest reduction in perplexity: 

P(Wi\Wi-l) = X-P(Wi\Wi-l,Wi-2)+(l-XyPsLM(Wi\Wi-l) 

The results are presented in Table 0. 



Language Model 

A 


L2F 
DEV set 


>. Perplexity 

TEST set 


0.0 


1.0 


0.0 


0.4 


1.0 


3-gram + Initl SLM 
3-gram + Reest SLM 


23.9 
22.7 


22.5 
22.5 


72.1 
71.0 


65.8 
65.4 


68.6 
68.6 



Table 2. Perplexity Results 
4.3. Lattice Decoding Results 

We proceeded to evaluate the WER performance of the SLM 
using the A* lattice decoder described previously. Before de- 
scribing the experiments we need to make clear one point; there 
are two 3-gram language model scores associated with the each 
link in the lattice: 

• the language model score assigned by the model that generated 
the lattice, referred to as the LAT3-gram; this model operates on 
text in the SWB-CSR tokenization; 

• the language model score assigned by rescoring each link in 
the lattice with the deleted interpolation 3-gram built on the data 
in the SWB-CSR-Treebank tokenization, referred to simply as 
the 3-gram — used in the experiments reported in the previous 
section. 

The perplexity results show that interpolation with the 3-gram 
model is beneficial for our model. Note that the interpolation: 

P(l) = X ■ P L AT3-gram(l) + (1 — A) • P SL m(1) 

between the LAT3-gram model and the SLM is illegitimate due 
to the tokenization mismatch. 

As explained previously, due to the fact that the SLM's 
memory extends over the entire prefix we need to ap- 
ply the A* algorithm to find the overall best path in the 
lattice. The parameters controlling the A* search were 
set to: logPcoMP = 0.5, IoqPfi nal = 2, LM weight 
= 12, logPip = 10, stack-ciepth-thresholci=30, 
stack-depth-logP-threshold=10 — see (^) and (0). 
The parameters controlling the SLM were the same as in pp. The 
results for different interpolation coefficient values are shown in 
Table H. 



Language Model 

Search 
A 


WER 

A* 


Vite 


0.0 


0.4 


1.0 


1.0 


LAT-3gram + SLM 


42.4 


40.3 


41.6 


41.3 



Table 3. Lattice Decoding Results 

The structured language model achieved an absolute improve- 
ment of 1% WER over the baseline; the improvement is statisti- 
cally significant at the 0.002 level according to a sign test. In the 
3-gram case, the A* search looses 0.3% over the Viterbi search 
due to finite stack and heuristic lookahead. 

4.4. Search Evaluation Results 

For tuning the search parameters we have applied the N-best 



lattice sampling technique described in Section 3.3. As a by- 
product, the WER performance of the structured language model 
on N-best list rescoring — N = 25 — was 40.9%. The average 
rank of the hypothesis found by the A* search among the N-best 



ones — after rescoring them using the structured language model 
interpolated with the trigram — was 1.07 (minimum achievable 
value is 0). There were 585 offending sentences — out of a to- 
tal of 2427 test sentences — in which the A* search led to a 
hypothesis whose score was lower than that of the top hypoth- 
esis among the N-best (1-best). In 310 cases the prefix of the 
rescored 1-best was still in the stack when A* returned — in- 
adequate compensation — and in the other 275 cases the 1-best 
hypothesis was lost during the search due to the finite stack size. 

One interesting experimental observation was that even 
though in the 585 offending cases the score of the 1-best was 
higher than that of the hypothesis found by A*, the WER of 
those hypotheses — as a set — was higher than that of the set of 
A* hypotheses. 

5. CONCLUSIONS 

Similar experiments on the Wall Street Journal corpus are re- 
ported in [H] showing that the improvement holds even when the 
WER is much lower. 

We believe we have presented an original approach to lan- 
guage modeling that takes into account the hierarchical structure 
in natural language. Our experiments showed improvement in 
both perplexity and word error rate over current language mod- 
eling techniques demonstrating the usefulness of syntactic struc- 
ture for improved language models. 
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